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NORTHSTAR:  A  Parameter  Estimation 
Method  for  the  Spatial  Autoregression  Model 

Mete  Celik,  Baris  M.  Kazar,  Shashi  Shekhar,  Daniel  Boley,  David  J.  Lilja 


Abstract 

Parameter  estimation  method  for  the  spatial  autoregression  model  (SAR)  is  important  because  of 
the  many  application  domains,  such  as  regional  economics,  ecology,  environmental  management,  public 
safety,  transportation,  public  health,  business,  travel  and  tourism.  However,  it  is  computationally  very 
expensive  because  of  the  need  to  compute  the  determinant  of  a  large  matrix  due  to  Maximum  Likelihood 
Theory.  The  limitation  of  previous  studies  is  the  need  for  numerous  computations  of  the  computationally 
expensive  determinant  term  of  the  likelihood  function.  In  this  paper,  we  present  a  faster,  scalable  and 
NOvel  pRediction  and  estimation  TecHnique  for  the  exact  SpaTial  Auto  Regression  model  solution 
(NORTHSTAR).  We  provide  a  proof  of  the  correctness  of  this  algorithm  by  showing  the  objective 
function  to  be  unimodular.  Analytical  and  experimental  results  show  that  the  NORTHSTAR  algorithm 
is  computationally  faster  than  the  related  approaches,  because  it  reduces  the  number  of  evaluations  of 
the  determinant  term  in  the  likelihood  function. 

Index  Terms 

Spatial  Autoregression  Model,  Spatial  Autocorrelation,  Spatial  Data  Mining,  Spatial  Databases, 
Maximum  Likelihood  Theory. 
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I.  Introduction 

Given  a  spatial  framework,  observations  on  a  dependent  variable,  a  set  of  explanatory  variables, 
and  neighborhood  relationships  (spatial  dependencies)  among  the  spatial  data,  SAR  parameter 
estimation  based  on  Maximum  Likelihood  Theory  (ML)  aims  to  find  the  optimum  SAR  model 
parameters  by  minimizing  the  likelihood  function  of  the  SAR  model  solution. 

The  massive  sizes  of  geo-spatial  datasets  in  many  application  domains  make  it  important 
to  develop  scalable  parameter  estimation  algorithms  of  the  SAR  model  solutions  for  location 
prediction  and  classification.  These  application  domains  include  regional  economics  [24],  ecology 
[9],  [40],  environmental  management  [19],  public  safety  [21],  transportation  [41],  public  health 
[43],  business,  travel  and  tourism  [1],  [39],  [38].  For  example,  predicting  the  locations  of  the  bird 
nests  in  a  wetland  is  a  location  prediction  problem.  In  this  example  dependent  variable  can  be 
bird  nest  location  and  explanatory  variables  can  be  vegetation  durability,  water  depth,  vegetation 
distribution,  etc.  Initially  classical  prediction  model,  e.g.,  linear  regression  was  used  for  this 
problem  [29].  However,  it  yielded  low  prediction  accuracy  [29]  because  the  autocorrelation  in 
spatial  data  violates  the  independently  and  identically  distributed  (i.i.d.)  assumption  that  underlies 
linear  regression.  SAR  improved  prediction  accuracy  in  this  problem  [9],  [40]. 

However,  estimation  of  the  SAR  model  parameters  is  computationally  very  expensive  because 
of  the  need  to  compute  the  determinant  of  a  large  matrix  in  the  likelihood  function.  The  Maximum 
Likelihood  function  for  SAR  parameter  estimation  contains  two  terms,  namely  a  determinant 
term  and  SSE  term.  The  former  involves  computation  of  the  determinant  of  a  very  large  matrix, 
which  is  a  well-known  hard  problem  in  numerical  analysis.  For  example,  the  exact  SAR  model 
parameter  estimation  for  a  10,000-point  spatial  problem  can  take  tens  of  minutes  on  common 
desktop  computers.  Computation  costs  make  it  difficult  to  use  SAR  for  many  important  spatial 
problems  which  involve  millions  of  points.  Because  of  the  high  cost  of  determinant  computation, 
the  use  of  the  SAR  model  has  been  limited  to  small  problem  sizes,  despite  its  promise  to  improve 
prediction  and  classification  accuracy. 

Previous  approaches  compute  the  determinant  term  of  a  large  matrix  of  the  SAR  model 
solution  repeatedly  to  determine  the  Maximum  Likelihood  values  of  SAR  parameters,  namely, 
an  autocorrelation  parameter  and  weights  for  explanatory  variables  [26],  [32],  [33],  [34],  [22], 
[37],  For  example,  they  find  the  optimum  spatial  autocorrelation  parameter  using  iterative  search 
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methods,  (e.g.,  golden  section  search)  in  the  interval  of  possible  values  (e.g.,  [0,1]). 

In  contrast,  our  approach  yields  a  reduction  in  computation  cost  by  reducing  the  number 
of  determinant  computations  of  a  very  large  matrix.  The  key  idea  is  to  narrow  the  search 
interval  by  a  cheap  computation  yielding  an  upper  bound  on  the  spatial  autocorrelation  parameter 
(Lemma  4.3).  Recall  that  the  ML-based  SAR  model  solution  contains  two  terms,  a  determinant 
term  and  an  SSE  term.  Both  terms  involve  the  spatial  autocorrelation  parameter  and  both  terms 
are  unimodular  in  the  autocorrelation  function  (Theorem  4.1).  In  addition,  the  location  of  the 
autocorrelation  parameter  that  minimizes  the  SSE  term  of  the  ML  function  is  an  upper  bound  on 
the  autocorrelation  parameter  that  optimizes  the  likelihood  function.  This  upper  bound  allows 
us  to  narrow  the  search  interval  and  reduce  the  number  of  iterations  and  number  of  determinant 
evaluations  of  an  iterative  search  to  estimate  the  spatial  autocorrelation  parameter  of  the  SAR 
model.  Of  course,  there  is  a  trade-off  between  the  extra  computations  to  determine  the  upper 
bound  and  the  savings  from  reduced  number  of  iterations.  If  the  overhead  of  determining  the 
upper  bound  is  much  smaller  than  the  resulting  savings,  then  our  approach  is  computationally 
more  efficient  than  the  previous  approaches. 

The  paper  evaluates  the  proposed  approach  analytically  and  experimentally.  Analytical  and 
experimental  results  show  that  the  proposed  approach  is  computationally  more  efficient  than  the 
previous  work.  We  analyzed  that  the  evaluation  of  the  SSE  term  of  the  ML  function  gives  an  upper 
bound  on  the  autocorrelation  parameter  of  the  likelihood  function  by  using  SAR  Unimodularity 
Theorem  4.1  and  Lemma  4.3.  Experimental  results  show  that  the  computational  cost  of  the 
proposed  approach  is  usually  smaller  than  the  cost  of  related  approaches.  The  experiments  show 
that  the  proposed  approach  is  computationally  more  efficient  than  the  related  approaches  in 
terms  of  execution  time  and  memory  usage.  In  addition,  when  the  value  of  the  autocorrelation 
parameter  decreases,  the  advantage  of  the  proposed  approach  increases.  It  is  also  observed  that 
determinant  computation  saving  increases  for  the  bigger  neighborhood  structures. 

A.  An  Illustrative  Application  Domain 

We  now  introduce  an  example  which  will  be  used  throughout  this  paper  to  illustrate  the 
different  concepts  in  spatial  data  mining.  We  are  given  data  about  two  wetlands,  named  Darr 
and  Stubble,  on  the  shores  of  Lake  Erie  in  Ohio  USA  in  order  to  predict  the  spatial  distribution  of 
a  marsh-breeding  bird,  the  red-winged  blackbird  ( Agelaius  phoeniceus).  The  data  was  collected 
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(c)  Water  Depth  (d)  Distance  to  Open  Water 

Fig.  1.  (a)  The  geometry  of  the  wetland  and  the  locations  of  the  nests,  (b)  The  spatial  distribution  of  vegetation  durability  over 

the  marshland,  (c)  The  spatial  distribution  of  water  depth ,  and  (d)  The  spatial  distribution  of  distance  to  open  water. 


from  April  to  June  in  two  successive  years,  1995  and  1996  [29]. 

A  uniform  grid  was  imposed  on  the  two  wetlands  and  different  types  of  measurements  were 
recorded  at  each  cell  or  pixel.  In  total,  values  of  seven  attributes  were  recorded  at  each  cell. 
Domain  knowledge  is  crucial  in  deciding  which  attributes  are  important  and  which  are  not. 
For  example,  Vegetation  Durability  was  chosen  over  Vegetation  Species  because  specialized 
knowledge  about  the  bird-nesting  habits  of  the  red-winged  blackbird  suggested  that  the  choice 
of  nest  location  is  more  dependent  on  plant  structure  and  plant  resistance  to  wind  and  wave 
action  than  on  the  plant  species. 

For  simplicity,  we  focus  on  three  independent  attributes,  namely  Vegetation  Durability,  Dis¬ 
tance  to  Open  Water,  and  Water  Depth.  The  significance  of  these  three  variables  was  established 
using  classical  statistical  analysis.  The  spatial  distribution  of  these  variables  and  the  actual  nest 
locations  for  the  Darr  wetland  in  1995  are  shown  in  Figure  1.  These  maps  illustrate  the  following 
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(a)  Pixel  Property  with  Indepen-  (b)  Random  Nest  Locations 

dent  Identical  Distribution 


Fig.  2.  Spatial  distribution  satisfying  random  distribution  assumptions  of  classical  regression 


two  important  properties  inherent  in  spatial  data. 

1)  The  value  of  attributes  which  are  referenced  by  spatial  location  tend  to  vary  gradually  over 
space.  While  this  may  seem  obvious,  classical  data  mining  techniques,  either  explicitly  or 
implicitly,  assume  that  the  data  is  independently  generated.  For  example,  the  maps  in 
Figure  2  show  the  spatial  distribution  of  attributes  if  they  were  independently  generated. 
Ozesmi  et  al.  has  applied  classical  data  mining  techniques  like  logistic  regression  [29]  and 
neural  networks  [28]  to  build  spatial  habitat  models.  Logistic  regression  was  used  because 
the  dependent  variable  is  binary  (nest/no-nest)  and  the  logistic  function  “squashes”  the 
real  line  onto  the  unit-interval.  The  values  in  the  unit-interval  can  then  be  interpreted  as 
probabilities.  The  study  concluded  that  with  the  use  of  logistic  regression,  the  nests  could 
be  classified  at  a  rate  24%  better  than  random  [28]. 

2)  The  spatial  distributions  of  attributes  sometimes  have  distinct  local  trends  which  contradict 
the  global  trends.  This  is  seen  most  vividly  in  Figure  1(b),  where  the  spatial  distribution 
of  Vegetation  Durability  is  jagged  in  the  western  section  of  the  wetland  as  compared  to 
the  overall  impression  of  uniformity  across  the  wetland.  This  property  is  called  spatial 
heterogeneity. 

Classification  accuracy  achieved  by  classical  and  spatial  regression  are  compared  on  the  test 
data.  Receiver  Operating  Characteristic  (ROC)  [14]  curves  can  be  used  to  compare  classification 
accuracy.  ROC  curves  plot  the  relationship  between  the  true  positive  rate  (TPR)  and  the  false 
positive  rate  (FPR).  For  each  cut-off  probability  b,  TPR(fi)  measures  the  ratio  of  the  number 
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of  sites  where  the  nest  is  actually  located  and  was  predicted  divided  by  the  number  of  actual 
nest  sites.  The  FPR  measures  the  ratio  of  the  number  of  sites  where  the  nest  was  absent  but 
predicted,  divided  by  the  number  of  sites  where  the  nests  were  absent.  The  ROC  curve  is  the 
locus  of  the  pair  (TPR(Z?),  FPR(Z?))  for  each  cut-off  probability.  The  higher  the  curve  above  the 
straight  line  TPR=FPR,  the  better  the  accuracy  of  the  model. 

Figure  3(a)  illustrates  the  ROC  curves  for  spatial  autoregression  regression  (SAR)  and  classical 
regression  models  built  using  the  real  surveyed  Darr95  learning  data  and  Figure  3(b)  displays 
the  ROC  curve  for  the  real  Stubble  test  data  [39].  It  is  clear  that  using  spatial  regression  resulted 
in  better  predictions  at  all  cut-off  probabilities  relative  to  the  classical  regression  model. 

Clearly,  by  including  a  spatial  autocorrelation  term,  there  is  substantial  and  systematic  im¬ 
provement  for  all  levels  of  cut-off  probability  on  both  the  learning  data  (1995  Darr)  and  test 
data  (1995  Stubble). 


ROC  Curve  for  learning  data(Darr  95) 


False  Positive  Rate 


ROC  Curve  for  testing  data(Stubble  marshland  1995) 


(a)  ROC  curves  for  learning 


(b)  ROC  curves  for  testing 


Fig.  3.  (a)  Comparison  of  the  classical  regression  model  with  the  spatial  autoregression  model  on  the  Darr  learning  data,  (b) 

Comparison  of  the  models  on  the  testing  data. 


B.  Problem  Statement 

Given  a  spatial  framework  S  for  the  underlying  spatial  graph  G,  and  the  attribute  functions 
fXk  over  S,  and  the  neighborhood  relationship  R,  we  can  build  the  SAR  model  and  find  its 
parameters  by  minimizing  the  objective  (log-likelihood)  function  as  can  be  seen  in  (1). 
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t{p\y)=  —  in|i-pW| 
n  ' - v - - 

log—det 

+  ln((I  -  pW)y)T(I  -  x(xTx)“1  xT)T(I  -  x(xTx)“1  xT)((I  -  pW)y)  (1) 

v-  ✓ 

SSE 

The  details  of  the  derivation  of  the  log-likelihood  function  for  the  ML-based  SAR  model  is 
given  in  Appendix  VI. 

The  problem  of  parameter  estimation  of  the  SAR  model  using  Maximum  Likelihood  Theory 
(ML)  is  formally  defined  as  follows: 

Given: 

•  A  spatial  framework  S  consisting  of  sites  {.S| ..... .sr, }  for  an  underlying  geographic  space 

G. 

•  A  collection  of  explanatory  functions  fXk  :  S  — >■  Rk,  k  =  1  ,..nA'.  Rk  is  the  range  of 
possible  values  for  the  explanatory  functions. 

•  A  dependent  function  /y  :  S  — )►  _Ry. 

•  A  family  F  (i.e.,  y  =  pWy  +  x/i  +  f)  of  learning  model  functions  mapping  R1  x  ...  x  RK  ->• 
Ry. 

•  A  neighborhood  relationship  on  the  spatial  framework. 

Find: 

•  The  SAR  scalar  parameters  p  and  the  regression  coefficient  vector  [3 

Objective: 

•  Minimizing  the  objective  function,  log-likelihood  function  i{p |y)  given  in  (1),  of  the  ML- 
based  SAR  model  solution. 

Constraints: 

•  Geographic  space  S'  is  a  multi-dimensional  Euclidean  Space. 

•  The  values  of  the  explanatory  functions,  the  fXk ’s  and  the  response  function  fy  may  not  be 
independent  with  respect  to  those  of  nearby  spatial  sites,  i.e.,  spatial  autocorrelation  exists. 

•  The  domain  Rk  of  explanatory  functions  is  the  one-dimensional  domain  of  real  numbers. 

•  The  domain  of  the  dependent  variable,  Ry  =  {0, 1). 

•  The  SAR  parameter  p  varies  in  the  range  [0, 1). 


•  The  error  is  normally  distributed  (Gaussian  error),  i.e.,  e  ~  iV(0,cr2I)  III).  In  other  words, 
the  error  is  composed  of  normally  distributed  random  numbers  with  unit  standard  deviation 
and  zero  mean. 

•  The  neighborhood  matrix  W  exhibits  sparsity. 


For  the  bird  location  prediction  example,  dependent  variable  y  can  be  the  locations  of  the 
nests.  Explanatory  variables  x  can  be  independent  variables,  namely  observations  of  Vegetation 
Durability ,  Water  Depth,  and  Distance  to  Open  Water.  Neighborhood  matrix  W  represents  the 
spatial  dependencies  of  the  neighboring  locations.  In  the  W  matrix,  neighboring  locations  are 
represented  by  Is,  and  the  rest  of  the  matrix  contains  a  value  of  zero. 

TABLE  i 

The  notation  used  in  this  study 


Variable 

Definition 

Variable 

Definition 

P 

The  spatial  autoregression  (autocorrela¬ 
tion)  parameter 

I 

Identity  matrix 

y 

n-by-1  vector  of  observations  on  the  de¬ 
pendent  variable 

€ 

n-by-1  vector  of  unobservable  error 

X 

n-by-k  matrix  of  observations  on  the  ex¬ 
planatory  variable 

bi 

Lower  bandwith  of  the  neighborhood  ma¬ 
trix 

w 

n-by-n  neighborhood  matrix  that  accounts 

for  the  spatial  relationships  (dependencie 

s)  among  the  spatial  data 

bu 

Upper  bandwith  of  the  neighborhood  ma¬ 
trix 

k 

Number  of  features 

tol 

Tolerance  value 

0 

/c-by-1  vector  of  regression  coefficients 

\ 

Eigenvalue  of  a  matrix 

n 

Problem  size  (number  of  observation 

points  or  pixels) 

<r2 

The  common  variance  of  the  error  e 

C.  Related  Work 

The  Maximum  Likelihood  Theory  (ML)  is  used  in  order  to  estimate  the  SAR  model  param¬ 
eters.  The  ML  function  (log-likelihood)  of  the  SAR  model  solution  given  in  (1)  is  computed 
by  calculating  the  maximum  of  the  sum  of  the  logarithm  of  the  determinant  (log-det  term)  of  a 
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large  matrix  and  a  sum-of-squared  errors  ( SSE )  term  [26],  [32],  [33],  [34],  [22],  [37] .  The  ML- 
based  SAR  model  solutions  can  be  classibed  into  two  categories,  exact  SAR  model  solutions 
and  approximate  SAR  model  solutions,  due  to  the  strategy  used  to  calculate  the  log-det  term  of 
a  large  matrix.  This  paper  focuses  on  ML-based  exact  SAR  model  solutions. 

To  estimate  the  parameters  of  a  ML-based  SAR  model  solution,  the  log-likelihood  function 
can  be  constructed,  as  shown  in  (1).  The  details  of  the  derivation  of  the  log-likelihood  function 
for  the  ML-based  SAR  model  is  given  in  Appendix  VI.  The  log-likelihood  function  of  the  ML- 
based  SAR  model  solution  basically  contains  two  terms,  namely,  a  log-det  term  and  an  SSE  term 
as  can  be  seen  in  (1).  The  estimation  procedure  involves  computation  of  the  logarithm  of  the 
determinant  of  (log-det)  a  large  matrix,  i.e.  (I  —  pW).  Computing  the  determinant  of  a  matrix 
is  very  expensive. 

In  the  literature,  there  are  two  ML-based  exact  SAR  model  solutions,  an  eigenvalue  com¬ 
putation  (EV)  based  solution  [26]  and  a  direct  (straight)  sparse  log-det  (SLD)  based  solution 
[30].  The  EV-based  SAR  model  solution  uses  dense  data  structures  to  find  the  determinant 
of  a  very  large  matrix.  Because  of  the  dense  representation  of  the  matrices  in  the  EV-based 
approach,  LU  factorization  of  a  large  matrix  requires  0(n3)  operations,  where  n  is  the  number 
of  observations.  LU  factorization  is  used  to  compute  determinant  of  the  large  matrix  [13],  [16]. 
This  leads  to  high  execution  time  and  memory  usage.  In  the  SAR  formulation,  neighborhood 
matrix  W  is  sparse.  Pace  and  Barry  proposed  an  SLD-based  SAR  model  solution  which  uses 
sparse  LU  factorization  using  sparse  data  structures  [30].  The  number  of  operations  of  sparse 
LU  factorization  is  0(2 nbubi),  where  bu  and  bi  correspond  to  the  upper  and  lower  bandwidths  of 
the  neighborhood  matrix  W.  Using  sparse  data  structures  drastically  decreases  the  computation 
time  and  memory  usage.  However,  even  if  sparse  data  structures  are  used,  the  computation  of 
the  computationally  expensive  log-det  term  of  the  log-likelihood  function  must  be  repeated  in 
the  parameter  estimation  process  of  the  SAR  model  (Figure  4).  As  a  result,  ML-based  exact 
SAR  solutions  in  the  literature  exhibit  high  computational  cost  and  thus  are  not  scalable  to  large 
problem  sizes. 

In  contrast,  we  limit  the  search  space  of  the  computationally  expensive  determinant  compu¬ 
tation  of  the  log-likelihood  function  by  finding  an  upper  bound  on  the  spatial  autocorrelation 
parameter.  First,  we  calculate  the  computationally  efficient  term  ( SSE  term)  of  the  log-likelihood 
function  for  finding  an  upper  bound  on  the  spatial  autocorrelation  parameter  and  then,  we  limit  the 
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Input: 

( Pstarutol ,  W) 

Output:  (Popt,  Popt,  vlpi} 

Step: 

1. 

Step  (i)  { 

2. 

Popt=GSS(range 

=  [o,i], 

3. 

start  = 

■  {p  start } , 

4. 

f loglike 

=  =£  •  In  |I  -  pW|  +  SSE) 

5. 

Compute  (, 0opt,a 

2opt)  } 

6. 

return  (popt,  /3opt,  crlpt) 

Fig.  4.  The  pseudocode  of  the  EV-based  and  SLD-based  SAR  model  solutions.  The  only  difference  of  the  related  works  is 
deciding  whether  to  use  sparse  data  structures  or  not.  SLD-based  solution  uses  sparse  data  structures  and  EV-based  solution 
uses  dense  data  structures.  GSS  stands  for  golden  section  search. 


number  of  evaluations  of  the  computationally  expensive  term  (log-det  term)  of  the  log-likelihood 
function  using  this  upper  bound  to  find  the  optimum  SAR  model  autocorrelation  parameter.  The 
proposed  algorithm  (NORTHSTAR)  promises  to  reduce  the  computational  cost  and  to  scale  to 
large  problem  sizes. 

D.  Contributions 

Major  contributions  of  this  study  include  the  following: 

1)  We  developed  a  faster,  scalable  and  NOvel  pRediction  and  estimation  TecHnique  for  the 
exact  SpaTial  AutoRegression  model  solution  (NORTHSTAR).  In  the  first  step  of  our 
approach,  the  SAR  model  parameters  are  estimated  using  the  much  less  computationally 
complex  sum-of-squared  errors  ( SSE )  term  of  the  log-likelihood  function.  A  second  com¬ 
putationally  more  complex  step  is  required  only  if  the  parameters  obtained  in  the  first 
step  are  not  in  the  desired  precision;  in  this  case,  the  log-det  term  is  embedded  into  the 
estimation  process. 

2)  We  analytically  showed  that  the  estimated  SAR  model  parameter  obtained  after  the  first 
step  can  be  used  as  an  upper  bound  in  the  second  step  based  on  SAR  Unimodularity 
Theorem  4.1  and  Lemma  4.3,  if  the  second  step  is  necessary. 
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3)  We  experimentally  showed  that  the  proposed  heuristic,  NORTHSTAR,  is  computationally 
more  efficient  and  scalable  (in  terms  of  memory  usage  and  CPU  time)  than  the  previous 
work,  i.e.,  the  eigenvalue  (EV)  based  and  straight  log-det  (SLD)  based  approaches. 

E.  Outline  of  the  Paper  and  Scope 

The  remainder  of  the  paper  is  organized  as  follows:  Section  II  presents  the  theory  of  the 
SAR  model.  The  proposed  approach,  NORTHSTAR,  for  the  SAR  model  solution  is  presented  in 
section  III.  Section  IV  gives  the  analysis  of  the  NORTHSTAR  algorithm  and  section  V  presents 
experimental  evaluations  of  the  proposed  algorithm.  We  conclude  and  summarize  the  paper  with 
a  discussion  of  future  work  in  section  VI. 

This  paper  focuses  on  developing  a  new  ML-based  exact  SAR  model  solution,  NORTHSTAR. 

II.  Basic  Concepts:  Spatial  Autoregression  (SAR)  Model 

The  SAR  model  [11],  [2],  also  known  in  the  literature  as  the  spatial  lag  model  [2]  or  mixed 
regressive  model  [31],  is  an  extension  of  the  linear  regression  model  and  is  given  in  (2). 


y  =  pWy  +  x/3  +  e  (2) 

In  the  equation,  y  is  the  n-by-l  vector  of  observations  on  the  dependent  variable,  where  n 
is  the  number  of  observation  points;  p  is  the  spatial  autoregression  parameter;  W  is  the  n- 
by  -n  neighborhood  matrix  that  accounts  for  the  spatial  relationships  (dependencies)  among  the 
spatial  data;  x  is  the  ri-by-k  matrix  of  observations  on  the  explanatory  variable,  where  k  is  the 
number  of  features;  /3  is  a  k-by- 1  vector  of  regression  coefficients;  and  e  is  an  ?r-by-l  vector  of 
unobservable  error  which  is  assumed  to  be  generated  from  independent  and  identical  standard 
normal  distribution.  Spatial  autocorrelation  term  pWy  is  added  to  the  linear  regression  model 
in  order  to  model  the  strength  of  the  spatial  dependencies  among  the  elements  of  the  dependent 
variable,  y.  Data  structures  of  the  SAR  equation  can  be  seen  in  Figure  5.  Construction  of  the 
neighborhood  matrix  W  is  discussed  in  Appendix  I  for  regular  and  irregular  grid  spaces. 

The  solution  procedure  for  the  SAR  equation  is  decided  to  be  more  complex  than  that  for 
the  linear  regression  equation  because  of  the  presence  of  the  pWy  term  on  the  right  side  of 
the  equation.  Also  notice  that  the  W  matrix  is  quadratic  in  size  relative  to  the  size  of  the  data 
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y 


W  y  x  p 


I1  I 


n-  by-1  1  -by- 1  n-by-n  n- by-1  n-by-k  k-by-\  K-by-1 


Fig.  5.  Data  structures  of  SAR  Model 


samples.  Fortunately,  very  few  entries  of  W  are  non-zero,  and  sparse  matrix  techniques  are  used, 
which  exploit  this  fact,  to  speed  up  the  solution  process. 

III.  Proposed  Approach 

This  section  describes  the  new  ML-based  exact  SAR  model  solution,  NORTHSTAR,  and  then 
discusses  the  design  decisions. 

A.  NORTHSTAR  Algorithm 

The  NORTHSTAR  algorithm  aims  to  decrease  the  number  of  computations  of  the  computa¬ 
tionally  expensive  log-det  term  of  the  log-likelihood  function  which  is  given  in  (1)  by  finding 
an  upper  bound  on  the  spatial  autocorrelation  parameter.  In  the  first  step  of  the  algorithm, 
an  upper  bound  on  the  spatial  autocorrelation  paremeter  is  estimated  using  a  computationally 
more  efficient  SSE  term  of  the  log-likelihood  function  of  the  SAR  model.  In  the  second  step, 
the  computationally  more  expensive  log-det  term  is  embedded  into  the  estimation  process.  The 
second  step  (of  the  NORTHSTAR  algorithm)  uses  the  upper  bound  on  the  spatial  autocorrelation 
parameter,  found  in  the  first  step,  to  narrow  the  search  space  and  to  decrease  the  number  of 
determinant  evaluations  of  a  large  matrix. 

The  pseudocode  of  the  NORTHSTAR  algorithm  is  given  in  Figure  6,  where  GSS  stands 
for  golden  section  search.  Instead  of  the  golden  section  search,  which  is  not  dependent  to  the 
derivative  of  the  optimized  function,  a  derivative-based  search  algorithm  can  be  used  for  faster 
convergence  to  the  optimal  SAR  parameter  p,  but  it  is  necessary  to  compute  the  inverse  of  a 
large  matrix  (I  —  pW),  which  is  as  costly  as  the  determinant  computation  of  a  large  matrix. 
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Input:  (Pstarutol,  W) 

Output:  (Papf,/3opt,cr^,f) 

Steps: 

1.  Step  (i)  { 

2.  pinit-est=GSS{range  =  [0, 1], 

3.  Start  =  { Pstart }, 

4.  f loglike  =  SSE )  } 

5.  Step  (ii)  { 

6.  p0pf—GSS(raTig€  —  [0?  pinit—  esf]> 

2-  start  {Pinit— est }  > 

8.  fioglike  =  =£  ■  In  |I  -  pW|  +  SS£) 

9.  Compute  (/3OJrf,  <7^t)  } 

10.  return  (popt ,  j3opt ,  <?oPt) 


Fig.  6.  The  NORTHSTAR  algorithm. 


B.  Design  Decisions 

The  design  decisions  for  the  NORTHSTAR  algorithm  consist  of  choosing,  the  range  of  the 
SAR  autocorrelation  parameter  p  and  neighborhood  structure,  (i.e.,  sparse  vs.  dense  neighborhood 
matrix). 

1 )  The  Range  of  SAR  Autocorrelation  Parameter  p:  The  range  of  the  p  parameter  affects 
the  performances  of  the  SAR  algorithms  since  it  determines  the  search  space  of  the  algorithm. 
Lemmas  3.1  and  3.2  helps  in  the  optimization  of  the  SAR  model  parameters  by  ensuring  that 
the  SAR  parameter  p  will  be  between  a  -1  and  1  interval  ,  thereby  reducing  the  search  space  of 
the  SAR  parameter  p. 

Lemma  3.1:  Neighborhood  matrix  W  has  real  eigenvalues,  regardless  of  the  neighborhood 
topology,  as  long  as  the  neighborhood  relation  is  symmetric  (i.e.  if  i  is  a  neighbor  of  j,  then  j 
is  a  neighbor  of  i). 

Proof:  Let  A  be  the  adjacency  matrix  for  the  neighborhood  graph  for  a  domain:  al:l  =  1  if 
and  only  if  nodes  i  and  j  are  neighbors  (i.e.  are  correlated).  All  other  off-diagonal  entries  and 
all  the  diagonal  entries  are  all  zero. 
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The  matrix  W  is  obtained  by  scaling  the  rows  of  A  so  that  the  entries  in  each  row  of  W 
add  up  to  one.  That  means  that  W  =  DA  for  some  diagonal  scaling  matrix  D.  The  matrix  D 
is  not  only  diagonal,  but  all  the  diagonal  entries  are  strictly  positive  (because  every  node  has 
at  least  one  neighbor).  Hence  the  square  root  of  D  is  well-defined,  and  so  is  the  inverse  of  D. 
So  it  can  be  seen  that  W  is  diagonally  similar  to  a  symmetric  matrix  and  hence  must  have  real 
eigenvalues.  Specifically,  D-1/2WD+1/2  =  D+1/2AD+1/2  and  it  is  symmetric  and  similar  to 
W.  ■ 

Lemma  3.2:  The  eigenvalues  of  the  row-stochastic  (i.e.,  row-normalized,  row-standardized 
or  Markov)  neighborhood  matrix  W  are  in  the  range  [—1,  +1]  (see  also  Theorem  5.3  in  §2.5  on 
page  49  of  [4],  §5.13.3  in  [27]). 

Proof:  All  eigenvalues  of  a  row-stochastic  matrix  is  bounded  by  1  in  absolute  value  by 
Perron-Frobenius  theorem  (please  see  page  32  of  [4]  and  page  120  of  [10]).  ■ 

It  is  also  possible  to  put  bounds  on  SAR  autocorrelation  parameter  p.  In  this  study,  we  used 
one  of  the  terms  of  the  log-likelihood  function  of  the  SAR  model  to  define  an  upper  bound  on 
the  SAR  autocorrelation  parameter  p. 

2)  Neighborhood  Structure:  Neighborhood  matrices  are  used  to  model  the  spatial  depen¬ 
dencies  of  given  objects.  Matrices  can  be  constructed  on  regular  or  irregular  grid  spaces  (Ap¬ 
pendix  I).  Although  it  is  possible  to  use  neighborhood  structures  on  irregular  grid  space  on  the 
NORTHSTAR  algorithm,  in  this  study,  we  used  a  two-dimensional  regular  grid  space  with  a  four- 
neighborhood.  We  also  compared  the  performances  of  the  algorithm  for  different  neighborhood 
structures. 

The  use  of  sparse  or  dense  representation  of  the  neighborhood  matrices  affects  the  execution 
time  very  much  since  the  dense  LU  factorization  (decomposition)  which  is  used  to  find  the  deter¬ 
minant  of  a  matrix  requires  n3  operations  while  the  sparse  version  needs  only  2 nbubi  operations 
[13],  [16],  where  bu  and  bi  correspond  to  the  upper  and  lower  bandwidths  of  the  neighborhood 
matrix  W  respectively.  In  this  study,  we  used  sparse  representation  of  neighborhood  matrices. 

IV.  Analysis  of  the  NORTHSTAR  Algorithm 
A.  Is  NORTHSTAR  correct? 

In  this  section,  we  show  that  the  SAR  log-likelihood  function,  given  in  (1),  is  unimodular  by 
developing  SAR  Unimodularity  Theorem  4.1.  We  also  show  that  SAR  autocorrelation  parameter 
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p  (pinit-est),  minimizing  the  SSE  term  is  an  upper  bound  of  SAR  autocorrelation  parameter  p 
minimizing  the  log-likelihood  function  by  developing  Lemma  4.3. 

Lemmas  3.1  and  3.2,  described  in  the  previous  section,  helps  reduce  the  search  space  of  the 
SAR  parameter  p  into  the  interval  of  [-1,+1],  Based  on  Lemmas  3.1  and  3.2,  we  describe  two 
Lemmas  (Lemma  4.1  and  Lemma  4.2).  These  lemmas  define  a  function  f(x)  which  has  a  form 
similar  to  that  of  the  exponential  of  SAR  log-likelihood  function  and  prove  that  there  is  only 
one  zero  of  such  a  function  in  the  interval  of  (-1,+1).  Then,  using  Lemma  4.1  and  Lemma  4.2, 
a  SAR  Unimodularity  Theorem  is  developed  which  shows  that  the  SAR  log-likelihood  function 
has  at  most  one  zero  within  the  interval  of  (-1,+  1). 

Lemma  4.1:  Let 


f(x)  = 


p(x ) 


q(x)n /2’ 

where  p(x)  is  a  polynomial  of  degree  n  with  all  real  distinct  zeroes  r\  <  . . .  <  rn  outside  the 
open  interval  (—1, 1),  and  q(x)  is  a  polynomial  of  degree  2  that  is  positive  for  all  real  x.  Then 
f'(x)  has  exactly  one  zero  in  each  open  interval  (ry,  ri+i),  i  =  1, . . .  ,  n  —  1.  In  particular,  f(x) 
has  at  most  one  zero  in  the  interval  (—1, 1),  and  hence  f(x)  is  unimodular  in  (—1, 1). 

Proof:  By  a  straightforward  computation,  the  derivative  of  /  is 


fix)  = 


p'q-^pq1  N(x) 


i/2+l 


D(x)' 


N (x)  is  a  polynomial  of  degree  at  most  rc+1.  For  polynomials,  p(x)  =  anxn+an- \Xn  - \-olq. 


p'q  -  \pc(  =  7„+i:r"+1  +  7 nxn  H - h  7o, 


and  q(x)  =  [52x2  +  +  Po  the  polynomial  N(x 

we  have  that  the  leading  coefficient  is  jn+i  =  nan  ■  62  —  an  ■  |2 f3-2  =  0.  Hence  N(x)  is  actually 
a  polynomial  of  degree  at  most  n. 

Now  we  localize  the  n  zeroes  of  the  polynomial  N(x).  At  each  point  r^,  N(rj)  =  p'(ri)q(ri)  f 
0  has  the  same  sign  as  p'(ri)  f  0,  for  i  =  1, . . .  ,  n.  Since  the  zeroes  of  p(x)  are  distinct,  the 
signs  of  p'(rj)  alternate,  hence  so  do  the  signs  of  N(rj).  Thus,  polynomial  N(x)  must  have 
at  least  one  zero  in  each  open  interval  (?y,?y+ 1),  for  1  —  1,. . .  ,n  —  1.  Call  these  zeroes  Sj, 
i  =  1, . . .  ,7i  —  1.  This  accounts  for  n—l  zeroes,  leaving  only  one  left.  Denote  this  one  left  over 
zero  s*. 

Because  all  the  7’s  are  real,  s*  must  be  real.  Within  each  interval  (?y,  ri+ 1),  there  must  be  an 


odd  number  of  zeroes  due  to  the  sign  changes,  but  we  have  only  one  zero  left  over,  so  the  extra 
zero  s*  must  satisfy  either  s*  <  r\  or  s*  >  r„. 
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Since  no  r,  is  in  the  interval  (—1,1),  f(x)  can  have  at  most  one  zero  within  (—1,1).  ■ 


Next,  we  extend  Lemma  4.1  by  taking  the  zero  eigenvalues  of  the  neighborhood  matrix  W 
into  account,  which  gives  us  Lemma  4.2. 

Lemma  4.2:  Let 


f(x)  = 


p(x ) 


q(x)n/2' 

where  p(x)  is  a  polynomial  of  degree  d  <  n  with  all  real  zeroes  r  \  <  ...  <  rd  outside  the  open 
interval  (—1, 1),  and  q(x)  is  a  polynomial  of  degree  2  that  is  positive  for  all  real  x .  Then  either 
f(x)  is  unimodular  in  (—1, 1)  or  there  is  a  unimodular  function  arbitrarily  close  to  /. 

Proof:  Let  e  be  chosen  arbitrarily  such  that  0  <  e  <  and  let  cr  j  =  1,2,...  be 
arbitrarily  distinct  but  small  numbers  such  that  c:l  <  e  for  all  j  and  c:l  >  0  for  j  >  d.  Write 


p(x)  =  ad(x  —  rf)  ■  ■  ■  (x  —  rd).  Define  the  ??-th  degree  polynomial  p( x)  =  ad(x  —  ri  +  ei)  ■  ■  ■  (x  — 

p(x) 

q(x)n/2  ‘ 


rd  +  ed)(  1  —  Cd+ix)  •••(!  —  enx )  to  be  a  “slight”  perturbation  of  p(x).  Define  f(x)  — 


The  function  f(x)  satisfies  Lemma  4.1,  so  is  unimodular  in  (—1, 1). 

By  construction,  \p(x)  —  p(x)\  <  e  ■  \e(x)\  where  e(x)  is  some  polynomial  of  degree  n  in  x, 
independent  of  e  (we  are  using  the  fact  that  e  <  |).  Hence  in  the  interval  (—1,1) 

maxxe(_i:i)  \e(x)\ 


I  f(x)  ~  f(x)  |  <  e 


nrm 


in^ef-ip)  q{x)n/'2' 


Due  to  the  Lemma  4.2,  function  f(x)  cannot  have  two  distinct  maxima  in  the  open  interval 
(—1,1),  because  one  would  not  be  able  to  find  a  unimodular  function  arbitrarily  close  to  it, 
unless  function  f(x)  is  constant. 

Theorem  4.1:  SAR  Unimodularity  Theorem:  The  log-likelihood  function  i{p |y)  as  a  function 
of  p  is  unimodular  for  p  E  (—1,1). 

Proof:  As  the  exponential  of  log-likelihood  function  i(p |y)  has  the  same  form  of  function 
f(x)  defined  in  Lemma  4.2,  it  directly  follows  that  SAR  likelihood  function  i{p |y)  is  unimodular. 


Since  the  log-likelihood  function  l{p |y)  is  unimodular,  the  golden  section  search  algorithm 
always  finds  the  global  minimum  of  the  log-likelihood  function.  Thus,  we  have  an  optimal 
parameter  estimation  for  the  ML-based  SAR  model  solutions.  We  plotted  the  SAR  log-likelihood 
function  i{p |y)  in  order  to  see  its  extrema  for  a  problem  size  of  2500  in  Figure  7.  As  can  be 
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Fig.  7.  The  components  of  the  log-likelihood  function  for  a  thematic  class  of  satellite  dataset  with  2500  observation  points 
and  the  log-likelihood  function  =  log-det  +  SSE.  ”rho”  stands  for  the  SAR  parameter  p. 


seen,  the  SAR  log-likelihood  function  is  unimodular  and  its  components  log-det  and  SSE  terms 
are  also  unimodular. 

Lemma  4.3:  The  initial  estimate  of  the  p  parameter  calculated  using  the  SSE  term 

of  the  log-likelihood  function  i{p |y),  in  the  first  step  (step(i))  of  NORTHSTAR  is  an  upper 
limit  on  the  location  of  autocorrelation  parameter  p  optimizing  the  SAR  log-likelihood  function 

?{p\y)- 

Proof:  For  most  of  the  data  mining  problems  of  interest,  spatial  autoregression  parameter 
p  is  in  the  interval  [0,1],  i.e.  0  <  p  <  1. 

Let  us  assume  that  functions//  and  /2  are  unimodular,  have  minimas  in  the  interval  [-1,+1]  and 
that  the  minima  of fl  is  less  than  or  equal  to  the  minima  of  /2  such  that  minima(/7)  <  minima(/2). 
In  that  case,  function  fl+f2  will  also  be  unimodular  and  will  have  a  minima  between  the  minima 
of  fl  and/2  such  that  minima(/7)<  mini  mat/'/ +/2)  <  minima(/2). 

Since  the  log-likelihood  function  i{p |y)  is  unimodular  and  both  the  log-det  and  SSE  terms 
are  also  unimodular,  the  minima  of  the  log-likelihood  function  i{p |y)  is  between  the  minimas 
of  the  log-det  term  and  the  SSE  term.  We  need  to  prove  that  the  minima  of  the  log-det  term  is 
less  than  or  equal  to  the  minima  of  the  SSE  term  and  the  minima  of  the  SSE  term  is  an  upper 
bound  for  the  log-likelihood  function. 
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For  a  given  0<p<l,  and  neighborhood  matrix  W  with  K  symmetric  pairs, 


log-det  =  In  |I  —  pW|  =  A'ln(l  —  p2)  (3) 

To  find  the  minima  of  the  log-det  term,  we  need  to  find  the  derivation  of  the  log-det  term  and 
to  set  the  derivation  at  0. 


d(log-det)  =  -2  pK 
d(p)  1  —  p2 

then  p  =  0,  which  is  the  minima  of  the  log-det  term. 

In  that  case,  the  minima  of  the  SSE  term  will  be  greater  than  or  equal  to  the  actual  p  value 
and  0  (which  is  the  minima  of  the  log-det  term)  and  also  it  gives  the  upper  bound  on  the  p 
value,  such  that  minimafSAF)  >  minima^ (p|y))  >  minima(log-det  term). 

Figure  7  shows  the  SAR  autocorrelation  parameter  p  values  which  are  minimizing  the  log-det 
term,  SSE  term  and  log-likelihood  function  i{p |y).  It  can  be  seen  that  both  the  log-det  and  SSE 
terms  are  unimodular  in  the  open  interval  (-1,+  1).  Thus,  the  minima  of  the  likelihood  function 
i{p |y)  is  between  the  minima  of  the  log-det  term  and  the  minima  of  the  SSE  term.  It  can  also 
be  seen  from  Figure  7  that  the  value  of  the  minima  of  the  log-det  term  is  less  that  the  value 
of  the  minima  of  the  SSE  term.  This  observation  shows  that  the  minima  of  the  SSE  term  is  an 
upper  bound  of  the  minima  of  the  sum  of  the  SSE  and  the  log-det  term  (The  sum  of  both  the 
terms  is  the  log-likelihood  function  i{p |y)  as  given  in  (1)).  ■ 

B.  How  Computationally  Expensive  is  the  Proposed  Algorithm-NORTHSTAR? 

In  this  section,  we  show  that  the  magnitude  of  the  log-det  term  of  the  SAR  log-likelihood 
function,  given  in  (1),  is  very  small  with  respect  to  the  magnitude  of  the  SSE  term  of  the  SAR  log- 
likelihood  function  i{p |y)  by  developing  the  Relative  Magnitude  Observation  (Observation  4.1). 
The  algorithms  in  the  previous  studies  calculate  the  optimum  SAR  parameters  using  the  log-det 
and  SSE  terms  in  the  estimation  procedure  at  the  same  time.  It  is  observed  that  if  the  magnitude 
of  the  SSE  term  is  bigger  enough  than  the  magnitude  of  the  log-det  term,  it  is  possible  that 
the  effect  of  the  log-det  term  in  the  calculations  will  be  dominated  by  the  magnitude  of  the 
SSE  term,  especially  when  the  problem  size  is  big.  In  such  cases,  the  SSE  term  itself  may  be 
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enough  to  find  optimal  p  parameter  and  there  may  be  no  need  to  include  the  log-det  term  in 
the  SAR  parameter  estimation  process.  The  NORTHSTAR  algorithm  is  designed  based  on  this 
observation  and  described  in  Lemma  4.4.  In  the  first  step  of  the  NORTHSTAR,  the  SSE  term 
is  used  to  find  the  optimal  p  parameter;  if  the  desired  precision  of  the  optimal  p  parameter  is 
not  enough,  a  second  step  is  required.  In  the  second  step,  the  log-det  term  is  included  in  the 
estimation  procedure  and  optimal  p  parameter,  found  in  the  fist  step,  is  used  as  an  upper  bound 
on  optimum  SAR  autoroccelation  parameter  p. 

Observation  4.1:  Relative  Magnitude  Observation:  The  log-det  term  of  the  SAR  log-likelihood 
function  i(p |y),  given  in  (1),  is  very  small  in  magnitude  with  respect  to  the  magnitude  of  the 
SSE  term  of  the  log-likelihood  function. 

Proof:  The  magnitude  (absolute  value)  of  the  log-det  term  of  the  log-likelihood  function 
given  in  (5)  is  a  function  of  the  p  parameter  and  p  <  1. 

—2  ”  \ 

—  y^lnfl  — pA,:)  <  abs  (— 2  ln(l  A  p)) 

11 U  J 

(5) 

It  can  be  seen  that  the  magnitude  of  the  log-det  term  is  determined  by  the  value  of  the  p 
value.  In  the  extreme  case,  p  can  be  maximum  1  and  the  value  of  equation  given  in  (5)  can  be 
approximately  1.4. 

In  contrast,  the  magnitude  (absolute  value)  of  the  SSE  term  of  the  log-likelihood  function 
f(p |y)  is  a  function  of  problem  size  n,  values  of  the  eigenvalues  and  dependent  vector  y  as  can 
be  seen  in  (6). 


-2 

abs (log-det  term)  =  abs  —  In  II  —  pWl 

'n 


abs(SSE  term)  =  abs  (ln((I  -  pW)y)T(I  -  M)T  (I  -  M)  ((I  -  pW)y)) 

=  abs  (ln(yrA  y))  =  n  *  E[Xf  yl  +  ...  +  yl)\  (6) 

where  M  =  x(xTx)-1xT,  A  =  ((I  —  pW)T(I  —  M)T(I  —  M)(I  —  pW).  E[.]  represents  the 
expected  value,  which  is  the  average  of  the  all  eigenvalues  in  (  6). 

It  can  be  seen  that  the  magnitude  of  the  SSE  term  is  much  bigger  than  the  magnitude  of  the 
log-det  term  (i.e.  (SSE  term)  (log-det  term)),  especially  when  the  problem  size  is  big  and  the 
norm  of  the  dependent  vector  y  is  big. 
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Details  of  the  proof  can  be  seen  in  Appendix  V.  ■ 

Table  II  shows  the  magnitudes  of  the  log-det  and  SSE  terms  at  the  optimal  p  value  where 
log-likelihood  function  i{p |y)  is  minimum  for  different  neighborhood  structures. 

TABLE  II 

The  magnitudes  of  the  log-det  and  SSE  TERMS  at  THE  OPTIMAL  p  VALUE  where  the  log-likelihood  function 

IS  MINIMUM 


Problem  Size(??,) 

Neighborhood 

popt 

abs(Log-likelihood) 

abs(SSE) 

abs(log-det) 

2500 

4-N 

0.467 

15.185 

15.125 

0.061 

2500 

8-N 

0.430 

15.267 

15.238 

0.028 

Lemma  4.4:  Let  the  9  ratio  be  the  ratio  of  the  magnitude  of  the  log-det  term  and  the  SSE 
term. 


abs(maxp (log-det  term)) 
a6s(min|p|<i(55£'  term)) 

If  the  9  ratio  is  small  enough,  there  is  no  need  to  include  the  log-det  term  in  the  estimation 
procedure  of  the  NORTHSTAR  algorithm.  Recall  that  the  9  ratio  will  be  small  when  the  problem 
size  gets  bigger  and  the  norm  of  y  vector  is  big. 

Proof:  The  Relative  Magnitude  Observation  (Observation  4.1)  proves  that  the  magnitude  of 
the  SSE  term  is  much  bigger  than  the  magnitude  of  the  log-det  term.  In  the  worst  case,  p  will  be 
close  to  1  and  the  absolute  value  of  the  log-det  term  will  be  close  to  1.4.  In  contrast,  the  absolute 
value  of  the  SSE  term  will  increase  with  the  increasing  problem  size  and  increasing  magnitude 
of  the  norm  of  vector  y.  In  this  case,  9  will  be  small  enough  and  the  effect  of  the  log-det  term 
will  be  small,  or  even  negligible,  on  the  log-likelihood  function  calculation.  Because  of  this 
property  of  the  9  ratio,  it  can  be  used  as  a  stopping  criteria  of  the  NORTHSTAR  algorithm  and 
may  eliminate  the  need  for  numerous  computations  of  the  determinant  of  a  large  matrix.  ■ 

This  leads  to  our  NORTHSTAR  heuristic,  which  can  be  defined  as  follows: 

•  (i)  Pinit-est  =  value  of  approximate  p  (ignoring  log-det  term) 

•  (ii)  Popi  =  value  of  p  approximated  in  the  second  step  of  NORTHSTAR  In  this  step  p, 
is  used  as  an  upper  bound. 

The  cost  of  the  NORTHSTAR  algorithm  is  dominated  by  the  sparse  LU  factorization  operation 
which  is  used  to  calculate  determinant  of  (I  — pW).  The  cost  of  it  will  be  (j  —  m)(2nbubi)  +  9n2  + 
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2j  —  3,  where  rn  is  the  savings  from  the  log-det  computation  when  there  is  no  stopping  criteria. 
When  Lemma  4.4  the  applied  to  the  algorithm,  the  cost  of  the  algorithm  will  be  dependent  on 
the  function  of  the  9  value,  such  that  (j  —  f(6))(2nbubi)+9n2 +  2j  —  3.  It  should  be  noted  that  the 
f(8)  value  will  be  close  to  the  rn  value  for  small  problem  sizes.  In  contrast,  f(6)  will  be  close 
to  j  for  big  problem  sizes  because  of  the  huge  savings  from  the  log-det  computation  (such  that 
f(8)  >>  in).  The  parameters  bu  and  bi  correspond  to  the  upper  bandwidth  and  lower  bandwidth 
of  the  neighborhood  matrix  W  respectively.  The  parameter  ( j  —  rn)  is  the  number  of  log-det 
computations  for  the  NORTHSTAR  algorithm.  Next,  we  compare  the  cost  of  NORTHSTAR  with 
the  related  approaches. 

C.  How  Does  NORTHSTAR  Cost  Compare  with  Related  Approaches? 

This  section  presents  the  cost-modeling  of  the  exact  SAR  model  solutions.  The  total  computa¬ 
tional  complexity  (the  operation  counts)  of  our  NORTHSTAR  heuristic  is  listed  in  Figure  8  and 
it  should  be  noted  that  j  <C  n.  The  eigenvalue  computation  based  SAR  model  solution  cannot  go 
beyond  problem  sizes  of  10 K  due  to  memory  constraints.  The  parameters  bu  and  bi  correspond 
to  the  upper  and  lower  bandwidths  of  the  neighborhood  matrix  W  respectively.  The  first  terms 
of  the  NORTHSTAR  and  SLD  cost  functions  (  ( j  —  f(9))(2nbubi)  and  j(2nbubi)  +  9 n2  +  j), 
respectively  )  are  the  costs  of  the  sparse  LU  factorization  which  needs  to  be  calculated  for  each 
p  until  it  reaches  its  optimum  value  popt.  The  rest  of  the  cost  functions  of  the  NORTHSTAR  and 
SLD  algorithms  are  the  costs  of  sparse  matrix- vector  multiplication  of  the  GSS  algorithm.  The 
first  term  of  the  cost  function  of  the  EV  is  dense  LU  factorization  and  the  rest  is  the  dense  matrix- 
vector  multiplication  of  the  GSS  algorithm.  The  EV-based  approach  is  more  expensive  than  the 
others  because  of  the  dense  LU  factorization  ( n 3  operation).  The  NORTHSTAR  algorithm  is 
more  efficient  than  the  SLD  algorithm,  since  it  decreases  the  number  of  computations  of  sparse 
LU  factorization.  f(8)  represents  the  savings  from  the  log-det  computation. 

Thus,  for  large  problem  sizes,  NORTHSTAR  is  much  more  computationally  efficient  than  the 
SLD  and  EV-based  approaches. 

V.  Experimental  Evaluation 

We  compared  the  NORTHSTAR  algorithm  with  the  EV-based,  and  SLD-based  solutions  using 
real  and  synthetic  datasets  to  estimate  SAR  model  parameters.  It  should  be  noted  that  the 
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Problem  Size 

NORTHSTAR 

EV-based  Approach 

SLD-based  Approach 

n 

(j  ~  f(9))(2nbubi)  +  9 n2  +  2 j  -  3 

|  n3  +  529 n2  +  j 

j(2nbubi)  +  9  n2  +  j 

Fig.  8.  The  total  computational  complexity  (the  operation  counts)  of  our  NORTHSTAR  heuristics  with  respect  to  the  previous 
exact  SAR  model  implementations.  The  variable  f(6)  is  the  number  of  savings  from  log-det  computations.  ”SLD”  stands  for 
the  straight  log-det  approach  and  ”EV”  stands  for  the  eigenvalue  approach. 


algorithms  give  the  same  SAR  parameter  estimates  since  all  of  them  are  the  exact  SAR  model 
solution. 

A.  Experimental  Design  and  System  Setup 

We  conducted  experiments  to  answer  the  following  questions: 

•  What  is  the  execution  time  and  memory  usage  of  the  proposed  algorithm? 

•  What  is  the  effect  of  the  value  of  the  SAR  parameter  p  on  the  NORTHSTAR  algorithm? 
What  is  the  behavior  of  the  NORTHSTAR  algorithm  for  varying  p  parameters?  How  does 
the  precision  of  the  predicted  p  parameter  affect  the  savings  from  log-det? 

•  What  is  the  behavior  of  the  NORTHSTAR  algorithm  for  different  problem  sizes? 

•  What  is  the  behavior  of  the  NORTHSTAR  algorithm  for  different  neighborhood  structures? 

•  What  is  the  effect  of  the  9  ratio  over  log-det  savings? 

The  control  parameters  for  the  experiments  are  summarized  in  Table  III.  Notable  solutions  for 
the  SAR  model  have  been  implemented  in  Matlab  [25].  The  system  setup  of  the  experiments  is 
shown  in  Figure  9. 

B.  Datasets 

1 )  Real  datasets:  We  used  real  datasets  from  ecology  and  satellite  remote-sensing  image  data 
in  order  to  evaluate  the  SAR  model  solutions. 

The  ecology  data  is  used  to  predict  bird  nest  locations,  as  explained  in  Section  I. A. 

The  satellite  remote-sensing  data  is  used  for  thematic  classification.  The  study  site  encompasses 
Carlton  County,  Minnesota,  which  is  approximately  20  miles  southwest  of  Duluth,  Minnesota. 
The  region  is  predominantly  forested,  composed  mostly  of  upland  hardwoods  and  low-land 
conifers.  There  is  a  scattering  of  agriculture  throughout.  The  topography  is  relatively  flat,  with 
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TABLE  III 

The  experimental  design 


Factor  Name 

Parameter  Domain 

Problem  Size  (n) 

100,  400,  900,  1600,  2500,  10,000,  160,000  and  1,000,000 

observation  points 

Neighborhood 

Structure 

2-D  with  4-neighbors,  8-neighbors,  and  12-neighbors 

Candidates 

Eigenvalue  Based  Approach,  Straight  Log-det  Based  Ap¬ 
proach,  and  NORTHSTAR 

SAR  Parameter  p 

[0,1) 

Optimization 

Non-derivative  Based  Optimization 

Dataset 

Real  and  Synthetic  Datasets 

Hardware 

IBM  Regatta  and  IBM  Netfinity  Linux  Cluster 

the  exception  of  the  eastern  portion  of  the  county  containing  the  St.  Louis  River.  Wetlands, 
both  forested  and  non-forested,  are  common  throughout  the  area.  The  largest  city  in  the  area  is 
Cloquet,  a  town  of  about  10,000.  For  this  study  we  used  a  spring  Landsat  7  scene,  taken  May 
31,  2000.  This  scene  was  clipped  to  the  Carlton  County  boundaries,  which  resulted  in  an  image 
of  size  1343  lines  by  2043  pixels  and  6-bands.  Out  of  this  we  took  a  subset  image  of  1200  by 
1800  to  eliminate  boundary  zero- valued  pixels.  This  translates  to  a  W  matrix  of  size  2.1  million 
x  2.1  million  (2.1M  x  2.1M)  points.  The  observed  variable  x  is  a  matrix  of  size  2.1M  by  6.  We 
chose  nine  thematic  classes  for  the  classification. 

In  thematic  classification,  the  goal  is  to  categorize  the  pixels  of  satellite  images  into  classes 
(e.g.,  water,  urban,  rural,  forest,...)  based  upon  the  values  of  the  ’’spectral  signatures”  recorded 
by  receivers  on  board  the  satellite.  The  problem  of  thematic  classification  has  deep  spatial 
connections  because  in  most  instances,  pixels,  which  are  neighbors  on  the  image,  belong  to  the 
same  class.  Thus  satellite  images  naturally  exhibit  high  spatial  autocorrelation  if  pixel  sizes  are 
smaller  than  the  size  of  spatial  features. 

2)  Synthetic  dataset  generation:  Synthetic  datasets  were  generated  using  standard  normal 
distribution  with  unit  standard  deviation  and  zero  mean  for  different  problem  sizes,  such  as 
?z=  1 00,  400,  900,  1600,  2500  and  for  different  p  parameters,  such  as  p=0.1,  0.2,  0.3,  ...,  0.9. 
Observation  variable  n-by-k  x  and  unobservable  error  n-by-1  e  were  generated  using  standard 
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Fig.  9.  The  flow  diagram  for  experiments 


normal  distribution  for  k= 5  explanatory  variables.  Regression  coefficients  k-by-1  6  vector  were 
taken  as  one  vector.  Using  these  parameters,  dependent  variable  y  was  generated  for  different 
problem  sizes  and  p  parameters. 

C.  Experimental  Results 

1 )  Scalability  and  memory  usage  of  the  algorithms:  We  compared  the  execution  time  and 
memory  usage  of  the  EV-based,  SLD-based,  and  NORTHSTAR  algorithms  for  different  problem 
sizes  using  the  real  dataset. 

Results  showed  that  the  NORTHSTAR  algorithm  is  faster  than  EV-based  and  SLD-based 
approaches,  especially  when  the  problem  size  is  increased,  because  of  the  log-det  savings  of  the 
NORTHSTAR  algorithm  (Table  IV). 

The  memory  usage  of  the  SLD-based  and  NORTHSTAR  algorithms  is  very  low  due  to  the 
sparse  representation  of  the  neighborhood  matrix  W  as  a  sparse  matrix  (Table  IV).  However, 
this  is  not  possible  for  the  EV-based  approach  since  it  has  to  use  the  dense  representation  of  the 
matrix.  Results  showed  that  NORTHSTAR  is  the  most  scalable  algorithm  among  the  exact  SAR 
model  solutions,  when  execution  time  and  memory  usage  are  considered. 

Since  the  execution  time  and  memory  usage  of  the  EV-based  approach  is  too  high,  we 
compared  only  the  SLD-based  and  NORTHSTAR  algorithms  in  the  rest  of  the  experiments. 
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TABLE  IV 

The  execution  time  and  memory  usage 


Problem  Size(??,) 

Execution  Time 

EV-based  App. 

SLD-based  App. 

NORTHSTAR 

400x400  (160,000) 

Intractable 

32  minutes 

24  minutes 

1000x1000  (1,000,000) 

Intractable 

72  hours 

45  hours 

Problem  Size(??.) 

Memory  (MB) 

Exact  EV 

Exact  SLD 

NORTHSTAR 

50x50  (2,500) 

50 

1 

1 

100x100  (10,000) 

2400 

4.5 

4.5 

400x400  (160,000) 

~  6.14  *  108 

70 

70 

1000x1000  (1,000,000) 

~  8*  10® 

450 

450 

2 )  Effect  of  the  SAR  parameter  p:  We  conducted  experiments  to  characterize  the  behavior  of 
the  NORTHSTAR  algorithm  for  varying  desired  precision  and  varying  values  of  the  p  parameter. 

Effect  of  the  desired  precision  of  the  p  parameter.  We  examined  the  effect  of  the  desired 
precision  of  the  p  parameter  for  problem  size  2500  of  the  SLD  and  NORTHSTAR  algorithms.  We 
used  the  real  dataset  (e.g.,  satellite  remote- sensing  dataset)  and  an  optimum  SAR  autocorrelation 
parameter  p  of  4.7293  *  10-1  for  this  dataset.  In  the  stage  (ii)  of  the  NORTHSTAR  algorithm 
set  the  starting  point  for  searching  optimal  p  value  as  (2*Pin*-esi) .  it  is  observed  that  when  the 
precision  decreased,  the  savings  from  the  log-det  computation  of  the  NORTHSTAR  increases 
(Figure  10). 

Effect  of  value  of  p  parameter.  We  conducted  experiments  to  determine  the  behavior  of  the 
NORTHSTAR  algorithm  and  compared  the  log-det  savings  of  the  SLD-based  and  NORTHSTAR 
algorithms  for  various  p  parameters.  Synthetic  datasets  were  produced  using  standard  normal 
distribution  for  different  p  parameters  for  the  fixed  problem  size  2500.  For  each  p  parameter, 
experiments  were  conducted  10  times  and  the  number  of  log-det  savings  in  Figure  11  shows 
the  average  of  these  10  runs.  Results  showed  that  if  the  value  of  the  spatial  autocorrelation 
parameter  p  is  low,  NORTHSTAR  algorithm  outperforms  SLD-based  approach  and  if  it  is  high, 
there  may  be  no  significant  savings  from  the  log-det  computation  (Figure  11).  It  is  a  fact  that 
when  the  p  parameter  is  close  to  1  (which  is  the  theoretical  upper  bound),  the  upper  bound  on 
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Fig.  10.  The  savings  from  log-det  computation.  ”rho”  stands  for  the  SAR  parameter  p 


rho  parameters  Problem  size 


Fig.  11.  Effect  of  value  of  p  parameter  Fig.  12.  Effect  of  problem  size 


the  SAR  autocorrelation  parameter  found  in  the  first  step  of  the  NORTHSTAR  will  be  close  to 
1  (Figure  11)  and  will  have  no  significant  effect  to  limit  the  search  space  of  the  second  step 
of  the  NORTHSTAR  algorithm.  In  such  a  case,  NORTHSTAR  will  behave  like  the  SLD-based 
approach.  For  SLD-based  approach,  the  bound  of  the  p  parameter  is  [0,1].  For  the  NORTHSTAR 
algorithm  the  bound  of  the  p  parameter  in  stage(i)  is  [0,1]  and  in  stage(ii)  [0,pinit\  is  the 
result  of  the  stage(i)). 

3)  Effect  of  problem  size:  :  We  conducted  experiments  to  see  the  behavior  of  the  NORTH- 
STAR  algorithm  for  different  problem  sizes  and  compared  it  with  SLD-based  approach  algorithm. 
In  the  experiments,  synthetic  dataset  are  used  for  problem  sizes  100,  400,  800,  1600,  and 
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Fig.  13.  2-D  regular  grid  space  and  Fig.  14.  Effect  of  different  neighborhood  structures 

neighborhoods  of  the  center  cell 


2500.  Datasets  are  produced  using  standard  normal  distribution  for  a  fixed  SAR  autocorrelation 
parameter  p  such  that  0.3.  Experiments  showed  that  NORTHSTAR  outperforms  SLD-based 
approach  and  the  number  of  log-det  computation  is  constant  with  the  increasing  problem  size 
for  SLD-based  approach  and  NORTHSTAR  (Figure  12).  Since  each  log-det  computation  will  be 
costly,  constant  saving  will  be  valuable,  especially,  with  the  increasing  problem  size. 

4)  Effect  of  neighborhood  structure:  :  We  conducted  experiments  using  different  neighborhood 
structures  to  determine  behavior  of  NORTHSTAR  algorithm.  Real  dataset  is  used  for  problem 
size  2500.  4-neighbors,  8-neighbors,  and  12-neighbors  structures  are  used  for  2-D  regular  grid 
space  (Figure  13).  For  the  2-D  regular  grid  space,  the  4-neighbors  of  a  cell  (center  cell  in  Figure 
13)  can  be  defined  as  the  cells  which  are  found  on  four  direction  of  it,  such  as  the  North  (N), 
South  (S),  East  (E),  and  West  (W).  For  the  2-D  regular  grid  space,  the  8-neighbors  of  a  cell 
(center  cell  in  Figure  13)  can  be  defined  as  the  cell  which  are  found  on  eight  directions  of  it, 
such  as  N,  NW,  W,  SW,  S,  SE,  E,  and  NE.  For  the  2-D  regular  grid  space,  the  12-neighbors 
include  second  North  (N2),  second  West  (W2),  second  (E2),  and  second  South  (S2)  cells  in 
addition  to  the  8-neighbors  cells. 

Experiments  showed  that  when  the  neighborhood  structure  increased  number  of  log-det  compu¬ 
tation  of  SLD  does  not  change,  although  number  of  the  log-det  computation  of  the  NORTHSTAR 
algorithm  decreases. 

5)  Effect  of  6  Ratio:  Experiments  showed  that  if  the  magnitude  of  the  log-det  term  is  less 
than  or  equal  to  ^  of  the  magnitude  of  the  SSE  term  computed  in  step(i)  of  NORTHSTAR,  then 
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the  p  value  that  our  NORTHSTAR  heuristic  finds  in  its  step  (i)  is  within  the  ±0.1  range  of  the 
optimal  p  value.  In  other  words,  if  the  desired  precision  is  ±0.1,  one  would  not  need  to  run  the 
second  step  of  NORTHSTAR,  where  we  compute  the  computationally  expensive  log-det  term. 

VI.  Conclusions  and  Future  Work 

Linear  regression  is  one  of  the  best-known  classical  data  mining  techniques.  However,  it  makes 
the  assumption  of  independent  identical  distribution  ( i.i.d .)  in  learning  data  samples,  which  does 
not  work  well  for  geo-spatial  data,  which  is  often  characterized  by  spatial  autocorrelation.  In 
the  SAR  model,  spatial  dependencies  within  data  are  taken  care  of  by  the  autocorrelation  term, 
and  the  linear  regression  model  thus  becomes  a  spatial  autoregression  model.  Incorporating 
the  autocorrelation  term  enables  better  prediction  accuracy.  However,  computational  complexity 
increases  due  to  the  need  for  computing  the  determinant  of  a  large  matrix  (I  —  pW). 

The  related  work  computes  determinant  term  of  a  large  matrix  of  SAR  model  solution  repeat¬ 
edly  to  determine  optimum  values  of  SAR  parameters,  namely,  autocorrelation  parameter  and 
weights  for  explanatory  variables. 

Since  the  determinant  computation  of  a  large  matrix  is  computationally  expensive,  we  devel¬ 
oped  a  faster,  scalable  and  novel  prediction  and  estimation  technique  for  an  exact  SAR  model 
solution  (NORTHSTAR)  which  is  based  on  sparse  LU  decomposition  and  aims  to  recude  the 
number  of  determinant  computation  of  a  very  large  matrix  .  This  yields  to  reduce  computation 
cost  of  the  SAR  parameter  estimation  process.  The  key  idea  is  narrow  the  search  interval  by  a 
cheap  computation  yielding  an  upper  bound  on  the  spatial  autocorrelation  parameter  (Lemma 
4.3).  In  the  paper  we  proved  that  both  terms  of  the  ML-based  SAR  model  solution  (determinant 
term  and  SSE  term)  are  unimodular  and  contains  SAR  autocorrelation  term  (Theorem  4.1).  In 
addition,  the  location  of  autocorrelation  parameter  minimizing  the  SSE  term  of  ML  function  is 
an  upper  bound  on  the  location  of  autocorrelation  parameter  optimizing  the  likelihood  function 
(Lemma  4.3).  This  upper  bound  allows  us  to  narrow  the  search  interval  and  reduce  the  number 
of  iterations  and  number  of  determinant  evaluations  of  an  iterative  search  to  estimate  the  spatial 
autocorrelation  parameter  of  SAR  model. 

Analytical  and  Experimental  evaluations  showed  that  our  approach  is  computationally  more 
efficient  than  related  work.  We  analyzed  that  the  evaluation  of  the  SSE  term  of  ML  function 
gives  an  upper  bound  on  autocorrelation  parameter  of  the  likelihood  function  by  using  SAR 
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Unimodularity  Theorem  4.1  and  Lemma  4.3.  Experimental  results  show  that  the  computational 
cost  of  proposed  approach  is  usually  smaller  than  the  of  related  approaches.  The  experiments 
show  that  proposed  approach  is  computationally  more  efficient  than  related  approaches  in  terms 
of  execution  time  and  memory  usage.  The  experiments  show  that  when  the  value  of  autocorrela¬ 
tion  parameter  gets  smaller,  the  advantage  of  the  proposed  approach  increases.  It  is  also  observed 
that  determinant  computation  saving  increases  for  the  bigger  neighborhood  structures. 

We  have  two  main  items  for  future  work: 

1)  Regarding  bounds  on  the  parameter  p,  we  will  investigate  ways  to  eliminate  computing 
some  of  the  eigenvalues  to  reach  very  high  problem  sizes  with  the  eigenvalue  approach. 

2)  We  plan  to  reduce  the  bandwidth  of  the  W  matrix  for  big  neighborhood  structures.  This 
will  help  us  obtain  a  more  efficient  solution  for  all  eigenvalues  using  sparse  matrix  algebra. 
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Appendix  I 

Constructing  Neighborhood  Matrix 
A.  Constructing  Neighborhood  Matrix  (W )  on  Regular  (Uniform)  Grid  Space. 

Several  previous  studies  have  shown  that  modeling  of  spatial  dependency  during  the  predic¬ 
tion  and  classification  process  improves  overall  prediction  and  classification  accuracy.  Spatial 
dependency  can  be  defined  by  the  relationship  between  spatially  adjacent  pixels  in  a  small 
neighborhood.  The  spatial  relationship  among  locations  in  a  spatial  framework  is  often  mod¬ 
eled  via  a  neighborhood  (contiguity)  matrix.  A  simple  neighborhood  matrix  may  represent  the 
neighborhood  relationship  defined  using  adjacency,  Euclidean  distance,  etc.  Example  definitions 
of  neighborhood  using  adjacency  include  2-neighborhood  on  1 -dimensional  grid  space  and  4- 
neighborhood,  8-neighborhood,  12-neighborhood  and  so  on  neighborhood  on  2-dimensional  grid 
space.  This  structure  is  also  known  as  regular  square  tessellation  1 -dimensional  and  2-dimensional 
planar  surface  partitioning  [17].  One  can  use  Moran’s  I  index  in  order  to  see  whether  there  is 
significant  spatial  dependency  in  the  given  dataset  (attributes).  Appendix  II  summarizes  Moran’s 
I  index  computation. 

The  rows  (neighboring  values)  of  neighborhood  matrix  W  sum  to  1,  which  means  that  W  is 
row-standardized  i.e.,  row-normalized  or  row-stochastic.  A  non-zero  entry  in  the  jth  column  of 
the  ith  row  indicates  that  the  jth  observation  will  be  used  to  adjust  the  prediction  of  the  ith  row 
where  i  is  not  equal  to  j. 

To  form  the  row-normalized  neighborhood  matrix  W,  first  a  non-normalized  neighborhood 
matrix  C  formed  by  putting  putting  ”l”s  for  neighborhoods  of  (i,j)th  pixel  of  the  spatial 
framework  and  by  putting  zeros  for  the  rest  of  the  entries.  Then,  non-zero  row  elements  of 
C  matrix  are  divided  by  each  row  sum  of  it.  The  algebraic  equivalent  of  this  definition  can  be 
formulized  as  W  =  D_1C  where  D  is  a  diagonal  matrix  whose  diagonal  elements  contains  row 
sums  of  matrix  C  and  the  rest  of  the  elements  of  the  D  matrix  is  zero,  that  is  da  =  c>j 
and  djj  =  0.  In  other  words,  W  matrix  is  formed  by  dividing  non-zero  elements  of  C  by 
corresponding  diagonal  element  of  D. 

Next,  illustration  of  the  neighborhood  matrix  formation  on  4-by-4  regular  grid  space  using 
4-neighborhood  spatial  relationships  is  discussed. 
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1 )  Illustration  of  the  Neighborhood  Matrix  Formation  on  a  4-by-4  Regular  Grid  Space: 
Spatial  dependency  can  be  defined  by  the  relationships  among  spatially  adjacent  pixels  in  a 
small  neighborhood  within  a  spatial  framework  that  is  a  regular  grid  space.  Given  a  gridded 
spatial  framework,  the  4-neighborhood  assumes  that  a  pair  of  locations  influence  each  other  if 
they  share  an  edge. 

For  the  4-neighborhood  case,  the  neighbors  of  the  ( i,j)th  pixel  of  the  regular  grid  are  the 
pixels  which  are  found  NORTH,  SOUTH,  EAST,  and  WEST  side  of  it  as  shown  in  Figure  15. 


r 


neighbor s(i,  j)  =  < 


(*  -  !o) 
(i,j  +  1) 
(*  +  !,  j) 

(hJ  ~  !) 


2  >  i  >  f,  1  >  j  >  q  NORTH 
1  >  i  >  0,  1  >  j  >  q  —  1  EAST 
2>i>(f)-l,l>j>q  SOUTH 
1  >  i  >  <t>,  2  >  j  >  q  WEST 


Fig.  15.  The  four  neighbors  of  the  pixel  on  the  regular  grid. 


Using  this  4-neighborhood  definition  the  non-row-normalized  spatial  neighborhood  matrix  C 
of  the  4-by-4  spatial  framework  given  in  Figure  16  can  be  formed  as  shown  in  Figure  17(a).  For 
example,  the  neighbors  of  pixel  6  of  the  spatial  framework  is  represented  in  the  6th  row  of  the 
non-row-normalized  neighborhood  matrix  C  (Figure  17(a))  and  the  neighbors  of  the  other  pixels 
are  represented  in  that  fashion.  The  2nd,  5th,  7th,  and  10th  columns  of  the  6th  row  contains 
value  ”1”  since  the  neighbors  of  the  pixel  6  are  pixels  2  (NORTH),  5  (EAST),  7  (WEST),  and 
10  (SOUTH)  in  the  spatial  framework. 

The  row-normalized  neighborhood  matrix  W  (Figure  17(b))  is  formed  by  dividing  neighboring 
values  by  the  row  sums  of  the  C  matrix.  For  example,  the  6th  row  of  the  C  is  divided  by  4 
which  is  row  sum  of  it. 

B.  Constructing  the  Neighborhood  Matrix  W  on  Irregular  Grid  Space 

Spatial  statistics  requires  some  means  of  specifying  the  spatial  dependence  among  observations 
[17].  The  neighborhood  matrix  i.e.,  W,  spatial  weight  matrix  fulfills  this  role  for  lattice  models 
[5],  [6]  and  can  be  formed  on  both  regular  and  irregular  grid.  This  section  shows  a  way  to  form 
the  neighborhood  matrix  on  the  irregular  grid  space  which  is  based  on  Delaunay  triangulation 
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Fig.  16.  The  spatial  framework  which  is  4-by-4  where  rows  may  or  may  not  be  equal  to  columns. 
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Fig.  17.  (a)  The  4  x  4-by-4  x  4  non-row-normalized  neighborhood  matrix  C  with  4  nearest  neighbors,  (b)  The  row-normalized 
version  i.e.  W  which  is  also  4  x  4-by-4  x  4.  The  product  4  x  4  is  equal  to  16.  the  problem  size  n. 


algorithm  [34],  [35].  [36]  describes  another  method  of  forming  the  neighborhood  matrix  on 
the  irregular  grid  which  is  based  on  nearest  neighbors.  One  specification  of  the  spatial  weight 
matrix  begins  by  forming  the  binary  adjacency  matrix  N  where  N,:l  =  1  when  observation  j  is  a 
neighbor  to  observation  i  (i  /  j).  The  neighborhood  can  be  defined  using  computationally  very 
expensive  Delaunay  triangulation  algorithm  [25].  These  elements  may  be  further  weighted  to  give 
closer  neighbors  higher  weights  and  incorporate  whatever  spatial  information  the  user  desires. 
By  itself,  N  is  usually  asymmetric.  To  insure  symmetry,  we  can  rely  on  the  transformation 
C  =  (N  +  NT)  /2.  The  rest  of  forming  neighborhood  matrix  on  irregular  grid  follows  the  same 
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Fig.  18.  Summary  of  formation  of  W  matrix. 


procedure  discussed  in  the  preceding  section.  Users  often  re-weight  the  adjacency  matrix  to  create 
a  row-normalized  i.e.,  row-stochastic  matrix  or  a  matrix  similar  to  a  row-stochastic  matrix.  This 
can  be  accomplished  in  the  following  way:  Let  D  represent  a  diagonal  matrix  whose  ith  diagonal 
entry  is  the  row-sum  of  the  ith  row  of  matrix  C.  The  matrix  W  =  DI2DI2C  =  D-1C  is 
row-stochastic  where  D12  is  a  diagonal  such  that  its  ith  entry  is  the  inverse  of  the  square 
root  of  the  ith  row  of  matrix  C.  Note  that  the  eigenvalues  of  the  matrix  W  do  not  exceed  1 
in  absolute  value,  and  the  maximum  eigenvalue  equals  1  via  the  properties  of  row-stochastic 
matrices  (see  Lemmas  3.2  and  3.1  and  Figure  18  in  this  study,  §5.13.3  in  [27]). 

From  a  statistical  perspective,  one  can  view  W  as  a  spatial  averaging  operator.  Given  the  vector 
y,  the  row-stochastic  normalization  i.e.,  Wy  results  in  a  form  of  local  average  or  smoothing  of 
y.  In  this  context,  one  can  view  elements  in  the  rows  of  W  as  the  coefficients  of  a  linear  filter. 
(See  [3],  [34],  [35],  [36]  for  more  information  on  spatial  weight  matrices.) 
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Appendix  II 

Moran’s  I  Index:  Quantifying  the  Autocorrelation  in  Datasets 

Spatial  autocorrelation  analysis  tests  whether  the  observed  value  of  a  variable  at  one  locality 
is  independent  of  the  values  of  the  variable  at  neighboring  localities.  If  a  dependence  exists,  the 
variable  is  said  to  exhibit  spatial  autocorrelation.  Spatial  autocorrelation  measures  the  level  of  in¬ 
terdependence  between  the  variables,  and  the  nature  and  strength  of  that  interdependence.  It  may 
be  classified  as  either  positive  or  negative.  In  a  positive  case  all  similar  values  appear  together, 
while  a  negative  spatial  autocorrelation  has  dissimilar  values  appearing  in  close  association  [23]. 

=  n  Er=i  E?=1  (Vijjxi  -  x)(xj  ~  x)) 

s0  E”=iOa-*)2 

The  term  ,S'0  is  equal  to  E"=i  Ey=i  wu  and  n 's  the  problem  size. 
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Appendix  III 

Simple  Overview  of  Log-likelihood  Theory 

We  simply  define  the  likelihood  function  in  this  section.  Let  Y  =  ( Y\  ,Y2, ....  Yn )  be  a 
random  vector,  and  define  a  statistical  model  { ,/y  f y | ^ )  :  0  G  0}  which  is  parameterized  by 
0  =  0n),  the  parameter  vector  in  the  parameter  space  ©.  The  likelihood  function  is  the 

mapping  defined  as  L  :  ©  — >■  [0, 1]  C  M  given  in  (8). 


L{6\y)  =  /v(y|0)  (8) 

In  other  words,  the  likelihood  function  is  functionally  the  same  in  form  as  a  probability  density 
function  (pdf).  However,  the  emphasis  is  changed  from  y  to  0.  The  pdf  is  a  function  of  the  y’s 
while  holding  the  parameters  0’s  constant,  L  is  a  function  of  the  parameters  0’s,  while  holding  the 
y’s  constant.  We  can  abbreviate  L(0|y)  to  L{6).  The  parameter  vector  0  such  that  L(0)  >  L(8) 
for  all  0  G  ©  is  called  maximum  likelihood  estimate,  or  MLE,  of  0.  Many  of  the  density  functions 
are  exponential  in  nature,  therefore  it  is  easier  to  compute  the  MLE  of  a  likelihood  function  L 
by  finding  the  maximum  of  the  natural  log  of  L,  known  as  the  log-likelihood  function  defined 
in  in  (9)  due  to  the  monotonicity  of  the  log  function.  Finding  maximum  of  a  function  is  carried 
by  taking  the  first  derivative  of  that  function  and  Ending  the  values  of  parameters  which  equate 
the  derivative  to  zero. 


f(0|y)  =  ln(L(0|y)) 


(9) 
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Appendix  IV 

Basic  Linear  Algebra  Facts 

This  appendix  section  presents  the  basic  linear  algebra  equalities  [7]  used  in  our  proofs. 

•  A  real  n-  by  -n  matrix  A  is  called  a  Markov  matrix,  or  row-stochastic  matrix  if: 

1)  ctjj  >  0  for  1  <  i,j  <  n; 

2)  i  aij  =  1  for  1  <  i  <  n. 

•  If  A  is  a  Markov  matrix,  then  AJ„  =  J„  where  J„  =  [1,  So,  1  is  always  an 

eigenvalue  of  a  Markov  matrix. 

•  If  A  and  B  are  Markov  matrices,  then  AB  is  also  a  Markov  matrix. 

.  tr{ A)  =  YJL i  a a 

•  YTi= i  ^  =  ^r(A) and  rr=i  ^  =  iai 

.  If: 

1)  xTAx  >  0  then  Vx  /  0  the  matrix  A  is  called  positive  definite. 

2)  xTAx  >  0  then  Vx  the  matrix  A  is  called  positive  semi-definite  matrix. 

•  All  positive  definite  matrices  are  non-singular. 

•  Eigenvectors  of  positive  semi-definite  matrices  are  non-negative. 

•  If  A2  =  A,  then  matrix  A  is  idempotent. 

•  All  idempotent  matrices  are  positive  semi-definite  with  non-negative  diagonal  elements  since 
ATA  =  AAT  =  A  =  A2.  Then,  xTAx  =  (Ax)TAx  which  is  just  a  sum  of  squares  of  the 
elements  of  Ax. 

•  If  the  square  of  an  idempotent  matrix  A  is  non-singular,  then  that  matrix  is  the  identity 
matrix.  Because:  A2  =  A  then  A_1A2  =  A-1A  which  means  A  =  I. 

•  Eigenvalue  of  an  idempotent  matrix  is  either  0  or  1. 

•  If  A  is  positive  definite  (or  positive  semi-definite)  matrix  and  B  is  non-singular  matrix  i.e., 
its  determinant  is  not  zero  then  BTAB  is  also  positive  definite  (or  positive  semi-definite) 
matrix  [7], 


39 


Appendix  V 

Proof  of  Observation  4.1 


Proof:  The  following  are  the  facts  used  in  this  proof: 

1)  Eigenvalues  of  a  Markov  matrix  are  in  the  range  [ — 1 . .  1] . 

2)  Maximum  eigenvalue  of  a  Markov  matrix  is  1. 

3)  abs(ln(l  —  xy))  <  abs(ln(l  +  t)),  if  x  and  y  are  between  0  and  1. 

4)  abs(ln(l  —  e))  >  abs(ln(l  —  e))  for  small  number  0  <  e  <  1 

5)  0  <  p  <  1  for  almost  all  spatial  data  mining  problems  of  interest. 

Now  let  us  use  these  facts  to  rewrite  f/  ln(l  —  pA;).  Let’s  assume  that  we  have  a  sequence 
of  sorted  n  eigenvalues  in  the  descending  order  and  have  l  positive  eigenvalues  and  n  —1  —  1 
negative  eigenvalues.  For  instance,  for  4-neighborhood  structure  l  is  n/2  and  for  8 -neighborhood 
case  l  is  less  than  n/2. 


—2  n 

abs  —  >  ln(l  —  p\j 

'  n 


2  =  1 


-2 


=  abs  T  £  ln(l  —  p  abs(Ai))  +  ^  ln(l  +  p  abs(Aj)) 


2  =  1 


2=Z+1 


< 

< 

< 

< 


-2 


abs  — Vln(l  +  pabs(A,))  due  to  4 

n 


-2 


2=1 


abs  i  —  V  ln(l  +  p)  due  to  3 

\  n 


2=1 


— 2?? 

abs  (  - ln(l  +  p) 


n 


abs  (— 2  ln(l  +  p))  due  to  5 


(10) 


The  quantity  given  in  (10)  is  small  as  long  as  p  <  1.  Now,  let’s  work  on  the  SSE  term  and  show 
that  (SSE)  (log-det). 

We  will  review  eigenvectors  of  symmetric  matrices  [8]  since  our  matrix  (I  —  pW)T(I  —  M)(I  — 
pW)  is  a  symmetric  matrix  which  we  represent  by  matrix  A. 

Suppose  A  G  R nxn  is  symmetric,  i.e.,  A  =  AT.  It  is  the  fact  that  the  eigenvalues  of  the 
matrix  A  are  real.  To  see  this,  suppose  Av  =  Av,  where  v  f  0.  Then,  vTAv  =  AvTv  = 
A^i  (abs  {v,))1  But  also  vTAv  =  AvTv  =  AJ^”=1  (abs  iv,))1 .  So,  we  have  A  =  A,  i.e., 
A  G  R  and  we  can  safely  assume  that  v  G  R"  . 

It  is  the  most  important  fact  that  there  is  a  set  of  orthonormal  eigenvectors  of  matrix  A  i.e., 
qi,  ...,qn  such  that  Aq;  =  A,qi,qiTqj  =  5jj  which  is  zero  and  each  eigenvector  has  unit  length 
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such  that  qiTqi  =  1  or  equivalently  norm  of  each  eigenvector  is  1  1 1  qj  1 1  =  1.  In  matrix  form: 
there  is  an  orthogonal  Q  such  that: 

Q_1AQ  =  QtAQ  =  A  (11) 


Hence  we  can  express  A  as: 


A  =  Q  XAQ  =  ^2  A»qi<JiQ 


(12) 


2=1 


Suppose  the  eigenvalues  of  A  are  sorted  so  Ai  >  ...  >  An.  Please  note  that  A  represents 
(I  -  pW)T(I  -  M)(I  -  pW).  Thus, 


abs  (yrAy) 


abs  (yrQrAQy) 
abs  ((Qry)rA(QTy)) 

abs  ( ^2  Ai(qiTy)2 


< 


abs 

n  * '  E[A?yTy]y 

per  element  SE 

n*  E [X1(y'l  + ...  +  y2n)] 


(13) 


where  E[.]  represents  expected  value  (i.e.  average  of  eigenvalues).  In  the  equation,  all  eigen¬ 
values  greater  symmetric  positive  semi-definite  matrix  are  positive  and  y]  ’s  of  an  order  of 
magnitude  which  makes  the  natural  logarithm  of  this  end-result  much  greater  than  the  log-det 


term. 
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Appendix  VI 

Derivation  of  the  ML  (Log-likelihood)  Function 

Ordinary  least  squares  are  not  appropriate  to  solve  for  the  models  given  in  (2).  One  way 
to  solve  is  to  use  the  ML  theory  procedure.  In  probability,  there  are  essentially  two  classes  of 
problems:  the  first  is  to  generate  a  data  sample  given  a  probability  distribution  and  the  second 
is  to  estimate  the  parameters  of  a  probability  distribution  given  data.  Obviously  in  our  case,  we 
are  dealing  with  the  latter  problem.  This  derivation  not  only  shows  the  link  between  the  need 
for  eigenvalue  computation  and  the  SAR  model  parameter  fitting  but  also  explains  how  the  SAR 
model  works  and  can  be  interpreted  as  an  execution  trace  of  the  solution  for  the  SAR  model. 
The  end-result  will  be  the  log-likelihood  function  that  is  used  in  the  optimization  of  SAR  model 
parameter  estimate  p.  We  presented  a  simple  overview  of  log-likelihood  theory  in  Appendix  III. 

We  begin  the  derivation  by  choosing  a  SAR  model  that  is  given  in  (2).  We  can  explicitly  write 
SAR  model  using  its  matrix-vector  form  as  follows: 


Ut  —  { I  —  pW)  1(xtlpl  +  Xt-2^2  +  •••  +  Xtkpk  +  g)  (14) 

where  t  =  1 , . . . ,  n  is  the  index  for  n  successive  observations.  Let  us  assume  that  the  distur¬ 
bances  or  error  et  is  distributed  normally,  independently  and  identically  with  mean  E(e)  =  0  and 
variance  a2.  The  set  of  n  such  equations  can  be  compiled  as  equation  (2).  Let  us  assume  that 
the  disturbances  et,  which  are  the  elements  of  the  vector  e  =  [f  | , ...,  r,, . ...  r„]  and  are  distributed 
independently  and  identically  according  to  a  normal  distribution  as  given  in  (15).  Let’s  call  the 
matrix  (I  —  pW)  as  matrix  A  to  simplify  the  expressions.  Please  note  that  r,  =  (A yt  —  xL8). 

N{et ;  0,  a2)  =  ^1— _  exp  (A yt  -  xt.p) )  (15) 

If  the  vector  e  has  a  multi-variate  normal  distribution  just  like  in  our  case,  the  normal 
distribution  is  then  defined  as  in  (16)  with  a  covariance  matrix  defined  as  E  =  a2 1.  Please 
note  that  lEl^1  =  cr",  E-1  =  -4 1  and  I E I  =  |cr2I|  =  a2n. 

II  7  I  I  I  I 

N(et;  0,E2)  =  (2tt)^|E|^  exp  (^E-1^) 

=  (27r)^|E|^  exp  -  xt.[3)T  E_1(A  yt  -  xt.(3)^j 


(16) 
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Then,  taking  the  xL  vectors  which  forms  the  rectangular  matrix  x  of  size  n-by-k  as  data,  the 
observations  yt  (where  t  =  1, . n)  have  density  functions  N(yt;  (A//,  —  Xt.0),  a2)  which  are  of 
the  same  form  as  those  of  the  disturbances,  and  the  likelihood  function  of  (3  and  a2,  based  on 
sample  is  given  in  (17)  [18].  Thus,  the  prediction  of  the  SAR  model  solution  heavily  depends 
on  the  quality  of  the  normally  distributed  random  numbers  generated. 

n 

L{6\y)  =  L((p,  j3,  a2)\(yt,  ay.,  W))  =  JJ  N(yt;  (Ayt  -  xt.P),  a2) 

t= i 

=  N(e;  0,  T,2)\de  /  dy\ 

=  (2tt) ^ |S| ^  exp  (^(Ay  -x(3)T  E_1(Ay-x/3)^  \de/dy\ 

=  (2na2)^  exp  ^-^(Ay  -  x/3)T  E_1(Ay  -  x/5)^  \de/dy\  (17) 

The  Jacobian  term  \de/dy\  [12],  [15]  needs  to  be  calculated  out  in  order  to  find  the  probability 
density  function  of  the  variable  y,  which  is  given  in  (18).  Please  note  that  e  =  (Ay  —  x/3)  and 
the  term  (Ay  —  /3)  is  also  known  as  the  vector  of  homoskedastic  random  disturbances  [2], 
[42].  The  Jacobian  term  is  equal  to  the  identity  matrix  I  in  classical  linear  regression  model  [2], 
The  need  for  the  Jacobian  term  is  formally  stated  and  proved  by  Theorem  7.1  (Theorem  6.1 
in  this  paper)  on  pages  232-233  of  [15].  We  provide  the  theorem  and  proof  for  the  reader’s 
convenience  by  converting  to  our  notation. 

\de/dy\  =  |  A  |  (18) 

Theorem  6.1:  Let  N(e;  0,  E2)  be  the  value  of  the  probability  density  of  the  continuous  random 
variable  e  at  et.  Since  the  function  given  by  y  =  A-1x/3  —  A-1e  is  differentiable  and  either 
increasing  or  decreasing  for  all  values  within  the  range  of  e  for  which  Ar(r;  0,  E2)  /  0,  then 
for  these  values  of  e,  the  equation  y  =  A-1x/3  —  A-1e  can  be  uniquely  solved  for  e  to  give 
e  =  Ay  —  x/3  and  the  probability  density  of  y  is  given  by: 

L(Q\y)  =  A(e;0,E2)|de/dy|  provided  A-1x/3  +  A-1e  ^  0  (19) 

Elsewhere,  L(6 |y)  =0. 

Proof:  The  proof  can  be  found  on  pages  233-235  of  [15].  ■ 
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L(0\y)  =  (27ra2)  2"  exp  ^^(Ay  -  x/3)T  (Ay  -  x/3)^  |A|  (20) 

L(8\y)  given  in  (20)  will  henceforth  be  referred  to  as  the  ’’likelihood  function  of  the  SAR 
model”.  It  is  a  probability  distribution  but  now  interpreted  as  a  distribution  of  parameters  which 
have  to  be  calculated  as  noted  in  the  Appendix  III.  Since  the  log  function  is  monotonic  and 
the  log-likelihood  function  is  unimodular  (See  SAR  Unimodularity  Theorem  4.1),  we  can  then 
equivalently  minimize  the  log-likelihood  function,  which  has  a  simpler  form  and  can  handle  large 
numbers.  This  is  because  the  logarithm  is  advantageous,  since  In  (ABC)  =  ln(A)+ln(5)+ln(C). 
After  taking  the  natural  logarithm  of  equation  given  in  (20),  we  get  the  log-likelihood  function 
given  in  (21). 


71  71  "1 

f(0|y)  =  lnL(0|y)  = --ln(2vr)  - -ln(a2)  -  —  (Ay  -  x/3)r  (Ay  -  x/3)  +ln|A|  (21) 

The  MLE  estimators  given  in  (22a)  and  (22b)  are  obtained  by  setting  and  to 

zero  respectively. 


j3  =  (xTx)  *xT  Ay  (22a) 

d2  =  (Ay)T(I  -  x(xTx)“1  xT)T(I  -  x(xTx)“1  xT)(Ay)/n  (22b) 

Replacing  j3  given  in  (22a)  with  /)  given  in  (21)  and  a2  given  in  (22b)  with  a2  given  in 
(21)  lead  to  equation  given  in  (23)  for  the  log-likelihood  function  (i.e.  the  logarithm  of  the  ML 
function)  to  be  optimized  for  p. 

71  71  f  1 

f(p|y)  =  HA|-  -ln(27r)--ln|- 

log—det  s  n/- 

constants 

£  In  (Ay)T(I  -  x(xTx)-1  xT)T(I  -  x(xTx)“1  xT)(Ay)  (23) 

2  ' - v - " 

SSE 

The  hrst  term  in  (23),  i.e.,  the  log-det,  is  nothing  but  the  logarithm  of  the  sum  of  a  collection 
of  scalar  values  including  all  of  the  eigenvalues  of  the  neighborhood  matrix  W  as  given  in  (24). 
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n  n 

|I-pW|  =  nU-pAO^  ln|I-pW|  =  ^ln(l-pA,-)  (24) 

2  =  1  2  =  1 

Hence,  the  final  form  of  the  log-likelihood  function  is  given  in  (25)  after  ignoring  constant 
terms  in  (23)  and  then  multiplying  the  resulting  equation  with  the  constant 

((ply)  =  —  In  |A|  +  ln(Ay)T(I  -  x(xTx)“1  xT)T(I  -  x(xTx)“1  xT)(Ay)  (25) 
n 

Therefore,  the  log-likelihood  function  i{p |y)  is  optimized  using  single-variable  optimization 
routine  Golden  Section  Search  to  find  the  best  estimates  for  p.  Once  the  estimate  for  p  is  found, 
both  (3  and  a2  can  be  computed.  Finally,  the  predicted  variable  (y  vectors  or  thematic  classes) 
can  be  computed  using  equation  (26). 


y=(I-pW)-1(x/3  +  e)  (26) 

Equation  (26)  needs  a  matrix  inversion  algorithm  in  order  to  get  the  predicted  observed 
dependent  variable  y.  For  small  problem  sizes,  one  can  use  exact  matrix  inversion  algorithms; 
however,  for  large  problem  sizes  (e.g.,  >  10K)  one  can  use  geometric  series  expansion  to  compute 
the  inverse  matrix  in  (26)  as  stated  by  Lemma  6.1. 

Lemma  6.1: 

oo 

(I  -  pw)-1  =  J>W)‘ 

2  =  0 

Proof:  Since  ||W||  <  1  and  |p|  <  1,  we  have  that  ||pW||  <  1.  We  then  apply  Lemma  2.3.3 
on  page  58  of  [16].  (Please  also  see  page  301  of  [20]).  ■ 

In  practice,  we  truncate  the  sum  to  at  most  30  terms,  fewer  if  p  is  bounded  away  from  1. 


